Unsupervised Learning Primier: Dimension Reduction & Clustering

Abstract

The goal of this article is to explore the possibilities of unsupervised learning methods in Data Science. I am currently (as of Jan. 2020) enrolled on 1st semester of CEE’s Best Data Science Programme according to Eduniversal Ranking, MSc in Data Science at University of Warsaw. By this time, I can tell, that topics like dimension reduction (e.g. MDS, PCA, t-SNE) and clustering (kMeans, PAM, DBSCAN) are fundamental skills to acquire if you wish to have a successfull career in Data Science or Machine Learning.

Datasets tend to be unlabeled, noisy or just too big. Given the first rule of modelling – “garbage in, garbage out”, you need to master the skills of data preparation in order to build the best models and give people the best recommendations. Here, I will work on a well known Wine Dataset from UCI Machine Learning Repository also available on Kaggle. Partly because I am a huge fan of Sauvignon Blanc and Chardonnay and partly because I think this data is great for explaining how those aforementioned techniques work. This Primier is about to deliver you the absolute best practices in this area, and is written to serve you as one-stop-shop for unsupervised learning basics.

Happy Reading!

Exploratory Data Analysis

This Dataset consists of several numeric variables, that describes the wine, and two categorical variables, which classifies the wine. I will be using techniques, that usually help you label the data and gather them in similar groups (similar points merge on a hyperplane). That’s why, I am not going to use the color and quality labels as inputs for the models. Nevertheless, during the evaluation it might be helpful to check, how well the wines are labeled (or classified) into groups. Let’s say, that in a perfect solution, we would end up with two main groups, where majority of wines are in the same color, and in those two groups, there will be structures visible, that mirror the quality of wines.

To sum up, the wines are categorized currently by the:

However I will analyse only the numeric variables, as you can see below.

The dataset, will be split among the training and testing part, to check the best models in the evaluation process, and answer the most important question (from the business perspective), “how well our model can be applied to new data?”.

wine.data = read.csv(file = "WineQuality.csv", header = T)

paste0("Number of rows, including NAs: ", dim(wine.data)[1])
## [1] "Number of rows, including NAs: 6497"
wine.data = wine.data %>%
  na.omit()

paste0("Number of rows, excluding NAs: ", dim(wine.data)[1])
## [1] "Number of rows, excluding NAs: 6463"
wine.data = wine.data %>%
  as.data.table()

var.list = colnames(wine.data[,2:(ncol(wine.data)-1)])

paste0(var.list)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"
str(wine.data)
## Classes 'data.table' and 'data.frame':   6463 obs. of  13 variables:
##  $ type                : Factor w/ 2 levels "red","white": 2 2 2 2 2 2 2 2 2 2 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  - attr(*, "na.action")= 'omit' Named int  18 34 55 87 99 140 175 225 250 268 ...
##   ..- attr(*, "names")= chr  "18" "34" "55" "87" ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(wine.data)
##     type      fixed.acidity    volatile.acidity  citric.acid    
##  red  :1593   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  white:4870   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##               Median : 7.000   Median :0.2900   Median :0.3100  
##               Mean   : 7.218   Mean   :0.3396   Mean   :0.3188  
##               3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##               Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.444   Mean   :0.05606   Mean   : 30.52     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.218   Mean   :0.5311  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.30   Median :6.000  
##  Mean   :10.49   Mean   :5.819  
##  3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :9.000
set.seed(1729)
sample.size = floor(0.80 * nrow(wine.data))

train.ind = sample(seq_len(nrow(wine.data)), size = sample.size)

wine.test = wine.data[-train.ind, ]
wine.data = wine.data[ train.ind, ]

summary(wine.test) #dataset for testing
##     type     fixed.acidity   volatile.acidity  citric.acid    
##  red  :331   Min.   : 4.40   Min.   :0.0800   Min.   :0.0000  
##  white:962   1st Qu.: 6.40   1st Qu.:0.2300   1st Qu.:0.2400  
##              Median : 7.00   Median :0.2900   Median :0.3100  
##              Mean   : 7.26   Mean   :0.3433   Mean   :0.3212  
##              3rd Qu.: 7.70   3rd Qu.:0.4100   3rd Qu.:0.4000  
##              Max.   :15.50   Max.   :1.5800   Max.   :0.9900  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.700   Min.   :0.01300   Min.   :  3.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 2.900   Median :0.04800   Median : 29.00     
##  Mean   : 5.496   Mean   :0.05648   Mean   : 30.74     
##  3rd Qu.: 8.000   3rd Qu.:0.06700   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.41500   Max.   :146.50     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9882   Min.   :2.740   Min.   :0.2200  
##  1st Qu.: 81.0        1st Qu.:0.9924   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :121.0        Median :0.9949   Median :3.200   Median :0.5100  
##  Mean   :116.7        Mean   :0.9949   Mean   :3.221   Mean   :0.5386  
##  3rd Qu.:156.0        3rd Qu.:0.9972   3rd Qu.:3.330   3rd Qu.:0.6200  
##  Max.   :307.5        Max.   :1.0390   Max.   :4.010   Max.   :1.9500  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.30   Median :6.000  
##  Mean   :10.46   Mean   :5.839  
##  3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :14.00   Max.   :8.000
summary(wine.data) #dataset for training
##     type      fixed.acidity    volatile.acidity  citric.acid    
##  red  :1262   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  white:3908   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##               Median : 7.000   Median :0.2900   Median :0.3100  
##               Mean   : 7.207   Mean   :0.3386   Mean   :0.3182  
##               3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##               Max.   :15.900   Max.   :1.3300   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.431   Mean   :0.05595   Mean   : 30.46     
##  3rd Qu.: 8.100   3rd Qu.:0.06400   3rd Qu.: 41.00     
##  Max.   :26.050   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2300  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5000  
##  Mean   :115.4        Mean   :0.9947   Mean   :3.218   Mean   :0.5293  
##  3rd Qu.:155.8        3rd Qu.:0.9969   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol        quality     
##  Min.   : 8.0   Min.   :3.000  
##  1st Qu.: 9.5   1st Qu.:5.000  
##  Median :10.3   Median :6.000  
##  Mean   :10.5   Mean   :5.813  
##  3rd Qu.:11.3   3rd Qu.:6.000  
##  Max.   :14.9   Max.   :9.000

It is important, to make the sampling in the right way, meaning, we should obtain e.g. the same proportion of wine and red wines in two subsets. As we see above, from the summary() output, the proportions are nearly identical.

## Warning in melt.data.table(wine.data): To be consistent with reshape2's
## melt, id.vars and measure.vars are internally guessed when both are 'NULL'.
## All non-numeric/integer/logical type columns are considered id.vars, which
## in this case are columns [type]. Consider providing at least one of 'id' or
## 'measure' vars in future.
## Warning in melt.data.table(wine.data): 'measure.vars' [fixed.acidity,
## volatile.acidity, citric.acid, residual.sugar, ...] are not all of the same
## type. By order of hierarchy, the molten data value column will be of type
## 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the first glipmse at the data, I can tell, that there a lot of outliers (long right tails of densities), and there are big differnces in distributions for each variables for white and red wines. That is why, I decide to analyze them side by side, as well as the full dataset for comparison. Dimension Reduction and Clustering is all about the distance between the variables, how much variability they consists of and how close the data points are to each other. Given that, not only dividing the data into “red” and “white” subset is mandatory, but scaling as well.

wine.red = wine.data %>%
  filter(type == "red") %>%
  within(rm("type", "quality")) %>%
  scale() %>%
  as.data.table()

wine.whi = wine.data %>%
  filter(type == "white") %>%
  within(rm("type", "quality")) %>%
  scale() %>%
  as.data.table()

wine.all = wine.data %>%
  within(rm("type", "quality")) %>%
  scale() %>%
  as.data.table()

colnames(wine.red) = colnames(wine.whi) = colnames(wine.all) = var.list

wine.all.noID = wine.all
wine.red.noID = wine.red
wine.whi.noID = wine.whi

wine.red$ID = seq.int(nrow(wine.red))
wine.whi$ID = seq.int(nrow(wine.whi))
wine.all$ID = seq.int(nrow(wine.data))
corrplot::corrplot(cor(wine.all.noID),
                   method = "color", col = viridis(10),
                   type = "upper", order = "hclust",
                   addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
                   sig.level = .05, insig = "blank",
                   diag = F, title = "Correlation plot of the variables for All Wines")
corrplot::corrplot(cor(wine.red.noID),
                   method = "color", col = viridis(10),
                   type = "upper", order = "hclust",
                   addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
                   sig.level = .05, insig = "blank",
                   diag = F, title = "Correlation plot of the variables for Red Wines")
corrplot::corrplot(cor(wine.whi.noID),
                   method = "color", col = viridis(10),
                   type = "upper", order = "hclust",
                   addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
                   sig.level = .05, insig = "blank",
                   diag = F, title = "Correlation plot of the variables for White Wines")

wine.all.melted = melt(wine.all[,1:(ncol(wine.all)-1)])
colnames(wine.all.melted) = c("Variable", "Value")

ggplot(wine.all.melted, aes(x = Variable, y = Value))+
  geom_boxplot(outlier.alpha = .5, 
               color = viridis(11)) +
  ggtitle("Cumulative Box-Plot for (scaled) Variables in all Wines")

wine.red.melted = melt(wine.red[,1:(ncol(wine.red)-1)])
colnames(wine.red.melted) = c("Variable", "Value")

ggplot(wine.red.melted, aes(x = Variable, y = Value))+
  geom_boxplot(outlier.alpha = .5, 
               color = viridis(11)) +
  ggtitle("Cumulative Box-Plot for (scaled) Variables in red Wines")

wine.whi.melted = melt(wine.whi[,1:(ncol(wine.whi)-1)])
colnames(wine.whi.melted) = c("Variable", "Value")

ggplot(wine.whi.melted, aes(x = Variable, y = Value))+
  geom_boxplot(outlier.alpha = .5, 
               color = viridis(11)) +
  ggtitle("Cumulative Box-Plot for (scaled) Variables in white Wines")

At this point, what I would like to make you aware of are the outliers. As I mentioned, PCA or kMeans magic is based on the distances and are linear approaches. That is why, you always should take a step back, when analysing your outliers data. And as we treat this article as a handbook, we will skip the outlier removal part, and see whether the results on full database are going to be robuts. But for the sake of curiosity, if I would like to apply technique to remove outliers, I would use the rule of a thumb of Mean +/- 1.5 IQR. There are multiple options to detect delete, perhaps by using functions within extra packages, but for me it seemed as a quick win, to make a nested loop as below.

Dimension Reduction: PCA

In this section I will run parallel PCAs on all wines, as well as the red and the white ones. Logically, different types of wine, should have different drivers for the output. Moreover, the PCA is a great step before moving into clustering. PCA in my opinion is somehow related to feature engineering. We are creating 11 Principal Components (new variables), which sometimes could be explained theoretically and practically.

all.pca.1 = prcomp(wine.all.noID, center=TRUE, scale.=TRUE)
whi.pca.1 = prcomp(wine.whi.noID, center=TRUE, scale.=TRUE)
red.pca.1 = prcomp(wine.red.noID, center=TRUE, scale.=TRUE)
summary(all.pca.1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     1.7388 1.5828 1.2497 0.98852 0.84039 0.7820 0.72505
## Proportion of Variance 0.2748 0.2278 0.1420 0.08883 0.06421 0.0556 0.04779
## Cumulative Proportion  0.2748 0.5026 0.6446 0.73340 0.79761 0.8532 0.90100
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.70301 0.58355 0.47427 0.17132
## Proportion of Variance 0.04493 0.03096 0.02045 0.00267
## Cumulative Proportion  0.94593 0.97688 0.99733 1.00000
summary(whi.pca.1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8064 1.2519 1.1030 1.01748 0.98838 0.96089
## Proportion of Variance 0.2967 0.1425 0.1106 0.09412 0.08881 0.08394
## Cumulative Proportion  0.2967 0.4391 0.5497 0.64384 0.73265 0.81659
##                            PC7     PC8     PC9    PC10    PC11
## Standard deviation     0.84338 0.77778 0.63751 0.53018 0.11746
## Proportion of Variance 0.06466 0.05499 0.03695 0.02555 0.00125
## Cumulative Proportion  0.88125 0.93625 0.97319 0.99875 1.00000
summary(red.pca.1)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7607 1.3811 1.2664 1.1014 0.97838 0.81206 0.75369
## Proportion of Variance 0.2818 0.1734 0.1458 0.1103 0.08702 0.05995 0.05164
## Cumulative Proportion  0.2818 0.4552 0.6010 0.7113 0.79834 0.85829 0.90993
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.63843 0.58829 0.42015 0.24615
## Proportion of Variance 0.03705 0.03146 0.01605 0.00551
## Cumulative Proportion  0.94698 0.97844 0.99449 1.00000

Taking a closer look, we can see that the first Component explains roughly a 30% of variance. Scree Plots, will help to visualize the marginal variance gained with each next Component.

plot(all.pca.1, type = "l", main = "Scree Plot for All Wines")
plot(whi.pca.1, type = "l", main = "Scree Plot for White Wines")
plot(red.pca.1, type = "l", main = "Scree Plot for Red Wines")

Now, it is clear, that we can extract best results from the White Wines, as the first Component is explaining a lot of variance, and additionally, the rest is becoming marginally effective in adding to cumulative variance. The less obvious are Red wines and the All Wines dataset. We can deduct, that the complexity of red type is larger, and more factors are needed to explain the wine.

fviz_pca_var(all.pca.1) + ggtitle("Directions for the first two PC in All Wines")
fviz_pca_var(whi.pca.1) + ggtitle("Directions for the first two PC in White Wines")
fviz_pca_var(red.pca.1) + ggtitle("Directions for the first two PC in Red Wines")

While analysing those plots, we need to keep in mind, that we rely only on a part of the variance explained, however in most cases it can be interpreted as good proxy for the whole dataset. We know, that certain Principal Components can be used to define a finite variability in the dataset, but which variables to choose from the initial list of 11, to perform the best clustering? Let us then, take a closer look at the added value of each factor in each of the 4 Principal Components.

You might also take a look of the individual contributions, but with this size of the data set, it would not be an efficient way.

fviz_contrib(all.pca.1, "var", axes = 1)
fviz_contrib(all.pca.1, "var", axes = 2)

fviz_contrib(all.pca.1, "var", axes = 3)
fviz_contrib(all.pca.1, "var", axes = 4)

fviz_contrib(whi.pca.1, "var", axes = 1)
fviz_contrib(whi.pca.1, "var", axes = 2)

fviz_contrib(whi.pca.1, "var", axes = 3)
fviz_contrib(whi.pca.1, "var", axes = 4)

fviz_contrib(red.pca.1, "var", axes = 1)
fviz_contrib(red.pca.1, "var", axes = 2)

fviz_contrib(red.pca.1, "var", axes = 3)
fviz_contrib(red.pca.1, "var", axes = 4)

Finally, we might take a look, a the variance maximizing algorithm, with a priori assumption of the number used factors (and we decided, that 4 is maximizing out marginal gain from each next factor). By analyzing, how our variables behave in those Components, we can make the final decision on the variables, that will stay in clustering stage. For the sake of curiosity, you might also want to minimize the number of factors, given your assumed minimal level of variance (quartimax).

varimax.all = principal(wine.all.noID, nfactors = 4, rotate = "varimax")
varimax.whi = principal(wine.whi.noID, nfactors = 4, rotate = "varimax")
varimax.red = principal(wine.red.noID, nfactors = 4, rotate = "varimax")
print(loadings(varimax.all), digits = 2, cutoff = .4, sort = T)
## 
## Loadings:
##                      RC1   RC2   RC3   RC4  
## volatile.acidity     -0.63                  
## free.sulfur.dioxide   0.83                  
## total.sulfur.dioxide  0.83                  
## residual.sugar        0.46  0.64            
## density                     0.90            
## alcohol                    -0.83            
## fixed.acidity        -0.50        0.61      
## citric.acid                       0.79      
## pH                               -0.73      
## chlorides                               0.60
## sulphates                               0.86
## 
##                 RC1  RC2  RC3  RC4
## SS loadings    2.47 2.28 1.70 1.62
## Proportion Var 0.22 0.21 0.15 0.15
## Cumulative Var 0.22 0.43 0.59 0.73
print(loadings(varimax.whi), digits = 2, cutoff = .4, sort = T)
## 
## Loadings:
##                      RC1   RC2   RC4   RC3  
## residual.sugar        0.80                  
## free.sulfur.dioxide   0.63                  
## total.sulfur.dioxide  0.76                  
## density               0.90                  
## alcohol              -0.76                  
## fixed.acidity               0.80            
## citric.acid                 0.56  0.42      
## pH                         -0.74            
## sulphates                         0.72      
## volatile.acidity                        0.75
## chlorides                         0.50  0.52
## 
##                 RC1  RC2  RC4  RC3
## SS loadings    3.09 1.68 1.16 1.16
## Proportion Var 0.28 0.15 0.11 0.11
## Cumulative Var 0.28 0.43 0.54 0.64
print(loadings(varimax.red), digits = 2, cutoff = .4, sort = T)
## 
## Loadings:
##                      RC1   RC2   RC3   RC4  
## fixed.acidity         0.91                  
## citric.acid           0.76       -0.43      
## density               0.75        0.48      
## pH                   -0.74                  
## free.sulfur.dioxide         0.88            
## total.sulfur.dioxide        0.88            
## volatile.acidity                  0.71      
## alcohol                          -0.80      
## chlorides                               0.80
## sulphates                               0.76
## residual.sugar              0.46            
## 
##                 RC1  RC2  RC3  RC4
## SS loadings    2.86 1.79 1.72 1.46
## Proportion Var 0.26 0.16 0.16 0.13
## Cumulative Var 0.26 0.42 0.58 0.71

Clustering: DBSCAN

Before we cluster the initial datasets, we might want to excercise the results of PCA a little bit more. DBSCAN is a great tool, to reduce noise in your data. In kMeans or PAM, you will always classify all datapoints. I mentioned outliers before – it is not always the case, that yor would like to group every datapoint. Sometimes, by extracting the point far away from the median/mean or center of mass of your dataset, you may obtain better fit. Of course, you are losing some of the information, but there are use cases, where you would like to explain black swans, and measure risk by understanding the density of the outliers, but there are also use cases, where you would like to apply your knowledge to the “business as usual” or ordinary, standard datapoints.

Applying PCA to our multidimensional dataset, we converted information for explainability. By losing around +50% of variance, we are now able to visualize our data using 2 dimensions, which sometimes, could mean a lot. Using DBSCAN, we will now stress the hypothesis, whether we could classify the wines, by relying only on those two artificial axes (which have a cumulative variance of around 50%).

varimax.all.2 = principal(wine.all.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.all.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.all = dbscan(varimax.all.2$scores, eps = 0.25, minPts = 5)
fviz_cluster(db.all, varimax.all.2$scores, geom = "point")

varimax.whi.2 = principal(wine.whi.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.whi.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.whi = dbscan(varimax.whi.2$scores, eps = 0.25, minPts = 5)
fviz_cluster(db.whi, varimax.whi.2$scores, geom = "point")

varimax.red.2 = principal(wine.red.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.red.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.red = dbscan(varimax.red.2$scores, eps = 0.30, minPts = 5)
fviz_cluster(db.red, varimax.red.2$scores, geom = "point")

Results od DBSCAN are interesting and helpful in assessing the quality of 1st and 2nd PC. We can clearly see, that there are multiple points that were identified as outliers. It means, that those 2 Principal Components are not enough, to differentiate between points in our dataset. There are also several groups of just a few points, which could be interesting if we would like to research the data deeper.

Howewer, given that the biggest cluster consists of nearly all points, DBSCAN here is perhaps not the best approach.

Clustering: kMeans

We first start by checking the optimal number of clusters based on the explained variance and the silhouette. From the results below, we interpret that for the full dataset, the most optimal division is into 2 or 3 clusters. As we would like to stress the hypothesis stated at the beginning, we will be performing 2-groups clustering for this dataset. White Wines, should be divided into 3 groups, while the Red ones into 6 or 7.

Optimal_Clusters_KMeans(wine.all.noID, max_clusters = 6, criterion = "variance_explained")
## [1] 1.0000000 0.7867291 0.6348875 0.5674126 0.5304595 0.5049044
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.all.noID, max_clusters = 6, criterion = "silhouette")
## [1] 0.0000000 0.2512374 0.2299682 0.2349282 0.1848423 0.1723347
## attr(,"class")
## [1] "k-means clustering"

Optimal_Clusters_KMeans(wine.whi.noID, max_clusters = 9, criterion = "variance_explained")
## [1] 1.0000000 0.7850155 0.7209216 0.6793957 0.6198868 0.5905757 0.5690738
## [8] 0.5420283 0.5193106
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.whi.noID, max_clusters = 9, criterion = "silhouette")
## [1] 0.0000000 0.2129875 0.1448907 0.1293172 0.1670323 0.1499838 0.1366008
## [8] 0.1318794 0.1341912
## attr(,"class")
## [1] "k-means clustering"

Optimal_Clusters_KMeans(wine.red.noID, max_clusters = 9, criterion = "variance_explained")
## [1] 1.0000000 0.8126158 0.7282846 0.6494929 0.6195329 0.5720478 0.5407613
## [8] 0.5212080 0.5070783
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.red.noID, max_clusters = 9, criterion = "silhouette")
## [1] 0.0000000 0.1940096 0.1626936 0.1571261 0.1115889 0.1239334 0.1416248
## [8] 0.1173657 0.1150920
## attr(,"class")
## [1] "k-means clustering"

Below, we cluster the data, vizualize the clusters and check the silhouette. I am using both, euclidean and manhattan measure, to compare them, which one deals better with high dimensionality.

kmeans.euc.all = eclust(wine.all.noID, "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.euc.all)
fviz_silhouette(kmeans.euc.all)
##   cluster size ave.sil.width
## 1       1 3444          0.13
## 2       2 1726          0.30

kmeans.man.all = eclust(wine.all.noID, "kmeans", hc_metric = "manhattan", k = 2)
fviz_cluster(kmeans.man.all)
fviz_silhouette(kmeans.man.all)
##   cluster size ave.sil.width
## 1       1 3444          0.13
## 2       2 1726          0.30

kmeans.euc.whi = eclust(wine.whi.noID, "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.euc.whi)
fviz_silhouette(kmeans.euc.whi)
##   cluster size ave.sil.width
## 1       1 1545          0.19
## 2       2 2363          0.24

kmeans.man.whi = eclust(wine.whi.noID, "kmeans", hc_metric = "manhattan", k = 2)
fviz_cluster(kmeans.man.whi)
fviz_silhouette(kmeans.man.whi)
##   cluster size ave.sil.width
## 1       1 1545          0.19
## 2       2 2363          0.24

kmeans.euc.red = eclust(wine.red.noID, "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.euc.red)
fviz_silhouette(kmeans.euc.red)
##   cluster size ave.sil.width
## 1       1  823          0.26
## 2       2  439          0.14

kmeans.man.red = eclust(wine.red.noID, "kmeans", hc_metric = "manhattan", k = 2)
fviz_cluster(kmeans.man.red)
fviz_silhouette(kmeans.man.red)
##   cluster size ave.sil.width
## 1       1  823          0.26
## 2       2  439          0.14

The results are very interesting. We have (somehow) sufficient amount of silhouette within the clusters. It is not the highest, but given the multidimensionality and the fact that, even wines of the same colour and with the same quality could be different, I am happy with the result.

comPart(
  cclust(wine.all.noID, 2, dist = "euclidean"),
  cclust(wine.all.noID, 2, dist = "manhattan"))
##       ARI        RI         J        FM 
## 0.8806309 0.9410980 0.8996105 0.9471581
comPart(
  cclust(wine.whi.noID, 2, dist = "euclidean"),
  cclust(wine.whi.noID, 2, dist = "manhattan"))
##       ARI        RI         J        FM 
## 0.8333470 0.9167623 0.8509426 0.9194973
comPart(
  cclust(wine.red.noID, 2, dist = "euclidean"),
  cclust(wine.red.noID, 2, dist = "manhattan"))
##       ARI        RI         J        FM 
## 0.8182864 0.9093618 0.8409772 0.9136307

Clustering with different distance metrics are rather stable, as the Rand Index, and similar measures indicates. We are working on a highly heterogeneous data there, especially on the All Wines dataset, thus I am happy with the results. Below are set of plots used to visualize distance from each cluster centroid. Left plot is for euclidean, while the other one is for manhattan metric. First pair is for All data, second for White Wines and the last pair is based on Red Wines segmentation. Bear in mind different “y-axis” limits.

stripes(cclust(wine.all.noID, 2, dist = "euclidean"))
stripes(cclust(wine.all.noID, 2, dist = "manhattan"))

stripes(cclust(wine.whi.noID, 2, dist = "euclidean"))
stripes(cclust(wine.whi.noID, 2, dist = "manhattan"))

stripes(cclust(wine.red.noID, 2, dist = "euclidean"))
stripes(cclust(wine.red.noID, 2, dist = "manhattan"))

Evaluation

Now some might ask “so what?”. We have done variable selection, where we investigated each one of them by its importance in terms of variance added. We have clustered the data in multiple, and evaluated the quality of this clustering.

Given that we spend so much time on exploring and tweaking the models, good thing would be if we could use it in the future and perhaps gain some additional information, by applying the models we have worked out.

Just for the sake of an example, I will try to categorize between White and Red Wines in the full testing sample, and then, divide the observations meant for testing into Red and White exclusively, to see, whether we can cluster by the quality.

Important note to take there, is that as Wines should differ between Red and White, because we saw a substantial and statistically significant differences in the variable levels, it is not that obvious, that we could cluster properly the “good” and “bad” ones, as this grade is completely subjective and is not a result of any computational model, which could be reengineered by us.

kmeans.all.testing = eclust(wine.test[,2:12], "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.all.testing)
fviz_silhouette(kmeans.all.testing)
##   cluster size ave.sil.width
## 1       1  762          0.51
## 2       2  531          0.53

white.subset.test = wine.test %>%
  filter(type == "white")
kmeans.whi.testing = eclust(white.subset.test[,2:12], "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.whi.testing)
fviz_silhouette(kmeans.whi.testing)
##   cluster size ave.sil.width
## 1       1  405          0.44
## 2       2  557          0.54

red.subset.test = wine.test %>%
  filter(type == "red")
kmeans.red.testing = eclust(red.subset.test[,2:12], "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.red.testing)
fviz_silhouette(kmeans.red.testing)
##   cluster size ave.sil.width
## 1       1  239          0.68
## 2       2   92          0.45

The results are surprisingly good, as we managed to obtain even higher silhouette as during the training phase. This indicates, that testing data is even better fitted. Let’s test

wine.test[,"cluster"] = kmeans.all.testing$cluster
white.subset.test[,"cluster"] = kmeans.whi.testing$cluster
red.subset.test[,"cluster"] = kmeans.red.testing$cluster

Below we can see, that the clustering worked well for the complete testing subset of data, and did good job in separating the wines, in more detail:

wine.test %>% filter(cluster == 2) %>% summary
##     type     fixed.acidity    volatile.acidity  citric.acid    
##  red  :313   Min.   : 4.800   Min.   :0.0800   Min.   :0.0000  
##  white:218   1st Qu.: 6.600   1st Qu.:0.2700   1st Qu.:0.1950  
##              Median : 7.400   Median :0.3900   Median :0.3000  
##              Mean   : 7.779   Mean   :0.4269   Mean   :0.2976  
##              3rd Qu.: 8.500   3rd Qu.:0.5700   3rd Qu.:0.4200  
##              Max.   :15.500   Max.   :1.5800   Max.   :0.7600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.700   Min.   :0.01300   Min.   : 3.00      
##  1st Qu.: 1.700   1st Qu.:0.04000   1st Qu.:10.00      
##  Median : 2.200   Median :0.06600   Median :17.00      
##  Mean   : 2.959   Mean   :0.06701   Mean   :18.02      
##  3rd Qu.: 2.950   3rd Qu.:0.08200   3rd Qu.:25.00      
##  Max.   :15.800   Max.   :0.41500   Max.   :54.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9884   Min.   :2.740   Min.   :0.2200  
##  1st Qu.: 30.50       1st Qu.:0.9925   1st Qu.:3.140   1st Qu.:0.4650  
##  Median : 64.00       Median :0.9953   Median :3.270   Median :0.5800  
##  Mean   : 61.39       Mean   :0.9949   Mean   :3.263   Mean   :0.5872  
##  3rd Qu.: 93.00       3rd Qu.:0.9972   3rd Qu.:3.380   3rd Qu.:0.6800  
##  Max.   :111.00       Max.   :1.0031   Max.   :4.010   Max.   :1.3600  
##     alcohol         quality         cluster 
##  Min.   : 8.40   Min.   :3.000   Min.   :2  
##  1st Qu.: 9.80   1st Qu.:5.000   1st Qu.:2  
##  Median :10.60   Median :6.000   Median :2  
##  Mean   :10.74   Mean   :5.804   Mean   :2  
##  3rd Qu.:11.50   3rd Qu.:6.000   3rd Qu.:2  
##  Max.   :14.00   Max.   :8.000   Max.   :2
wine.test %>% filter(cluster == 1) %>% summary
##     type     fixed.acidity    volatile.acidity  citric.acid    
##  red  : 18   Min.   : 4.400   Min.   :0.1050   Min.   :0.0000  
##  white:744   1st Qu.: 6.400   1st Qu.:0.2200   1st Qu.:0.2600  
##              Median : 6.800   Median :0.2600   Median :0.3200  
##              Mean   : 6.899   Mean   :0.2852   Mean   :0.3376  
##              3rd Qu.: 7.400   3rd Qu.:0.3200   3rd Qu.:0.3900  
##              Max.   :14.200   Max.   :0.9650   Max.   :0.9900  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.800   Min.   :0.01500   Min.   :  5.00     
##  1st Qu.: 2.100   1st Qu.:0.03800   1st Qu.: 29.00     
##  Median : 6.500   Median :0.04600   Median : 38.00     
##  Mean   : 7.263   Mean   :0.04915   Mean   : 39.61     
##  3rd Qu.:11.175   3rd Qu.:0.05275   3rd Qu.: 50.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :146.50     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :107.0        Min.   :0.9882   Min.   :2.830   Min.   :0.2900  
##  1st Qu.:129.0        1st Qu.:0.9924   1st Qu.:3.100   1st Qu.:0.4200  
##  Median :150.0        Median :0.9946   Median :3.180   Median :0.4800  
##  Mean   :155.3        Mean   :0.9948   Mean   :3.192   Mean   :0.5047  
##  3rd Qu.:178.0        3rd Qu.:0.9972   3rd Qu.:3.280   3rd Qu.:0.5600  
##  Max.   :307.5        Max.   :1.0390   Max.   :3.820   Max.   :1.9500  
##     alcohol         quality         cluster 
##  Min.   : 8.00   Min.   :3.000   Min.   :1  
##  1st Qu.: 9.30   1st Qu.:5.000   1st Qu.:1  
##  Median :10.00   Median :6.000   Median :1  
##  Mean   :10.27   Mean   :5.864   Mean   :1  
##  3rd Qu.:11.00   3rd Qu.:6.000   3rd Qu.:1  
##  Max.   :13.90   Max.   :8.000   Max.   :1
wine.test %>% group_by(type) %>% summarise(medianCluster = median(cluster),
                                           meanCluster = mean(cluster),
                                           stdCluster = sd(cluster))
## # A tibble: 2 x 4
##   type  medianCluster meanCluster stdCluster
##   <fct>         <dbl>       <dbl>      <dbl>
## 1 red               2        1.95      0.227
## 2 white             1        1.23      0.419

And below is the summary of White and Red training subsets, which in my subjective opinion did worse, in the quality classification:

white.subset.test %>% group_by(quality) %>% summarise(medianCluster = median(cluster),
                                                      meanCluster = mean(cluster),
                                                      stdCluster = sd(cluster))
## # A tibble: 6 x 4
##   quality medianCluster meanCluster stdCluster
##     <int>         <dbl>       <dbl>      <dbl>
## 1       3             1        1.25      0.5  
## 2       4             2        1.67      0.479
## 3       5             1        1.45      0.498
## 4       6             2        1.59      0.493
## 5       7             2        1.73      0.444
## 6       8             2        1.64      0.485
red.subset.test %>% group_by(quality) %>% summarise(medianCluster = median(cluster),
                                                    meanCluster = mean(cluster),
                                                    stdCluster = sd(cluster))
## # A tibble: 6 x 4
##   quality medianCluster meanCluster stdCluster
##     <int>         <dbl>       <dbl>      <dbl>
## 1       3             1        1         0    
## 2       4             1        1.25      0.463
## 3       5             1        1.39      0.490
## 4       6             1        1.20      0.405
## 5       7             1        1.13      0.341
## 6       8             1        1.25      0.5

Summary

In conclusion, one can easily see, that it is not simple, to run such analysis. At every stage, you might wonder, whather you made the right decision, and there are dozen of ways to evaluate or measure the effects of this decision. This Primier is barely touching those methods. PCA, DBSCAN and kMeans Clustering have very rich theoretical background, in which I deliberately did not immerse here, but I definitely recommend you to get familiar with.

As for the results, we obtained a number of differend Principal Components, we also tried to rotate the dimensions in a such way, to optimize the number of variance with just 4 of them. Nevertheless, it did not help to deliver a substantial amount of variance, to just drop the initial dataset and work on an artificial one. Moreover, most of the Components were created using almost every variable, what made the interpretation and explaiability of new features rather not possible. Later during the clustering phase, we reduced automatically the number of dimensions to two, in order to run DBSCAN and kMeans. DBSCAN showed, that a lot of variance is still needed to fully understand the behaviour of Wine Data, because while we limited the explainability by a fixed number of Components, we lost a major part of ability to differentiate.

Outliers were a major challenge during this analysis, but for the sake of learning, I decided to leave them in order to explain and show, why they might cause problems in certain analyses.

During the model evaluation, we have obtained results better than expected, and that is actually a pity, because worse testing than training performance, would once more indicate, that each of the phase is important and even, when you are fitting the models the best you can, there can be always problems during the evaluation. Here however, we have obtained good results and were able to classify the data using only unsupervised learning methods, which is quite a challange!